Load Libraries and Data

library(tidyverse)
library(sf)
library(lubridate)
library(tidytext)
library(scales)
library(viridis)

# Load the single, final clean file
inspections <- read_csv("data/inspections_clean.csv")

EDA Report

The goal of this section is to understand the overall shape of our data.

Basic Plots

# Inspections by City
ggplot(inspections, aes(x = city_source)) + geom_bar() + labs(title = "Inspections by City") + theme_minimal()


#Outcomes
ggplot(inspections, aes(x = inspection_outcome)) + 
  geom_bar() +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Outcomes")


#Risks
ggplot(inspections, aes(x = risk_level_standard)) + 
  labs(title = "Risks") + 
  scale_y_continuous(labels = label_comma()) + 
  geom_bar()

  
#Time Range of the Data
ggplot(inspections, aes(x = inspection_date)) + geom_histogram(bins = 50) + 
scale_y_continuous(labels = label_comma()) +
  labs(title = "Distribution of Inspections Over Time", x = "Date", y = "Count")

Our initial phase of exploratory data analysis (EDA) focused on understanding the overall shape, distribution, and integrity of our combined 1.4 million-row dataset. This initial analysis confirms our combined dataset is large, usable, and modern, but critically imbalanced.

Plot 1: Inspections by City This bar chart shows the total number of inspections contributed by each city. The key finding here is that our dataset is heavily imbalanced. Boston is the source of a large majority of our data (over 750,000 rows), while Chicago and New York are similarly sized, and San Francisco is our smallest dataset.

This is a critical finding because it confirms that we cannot use raw counts to compare cities. All future city-to-city comparisons must be based on proportions or percentages (using position = “fill”) to be fair and accurate.

Plot 2: Overall Inspection Outcomes This plot shows the distribution of our newly-created inspection_outcome column across the entire dataset. While “Fail” and “Pass” are the most frequent outcomes, this plot is heavily skewed by the disproportionate size of the Boston data. We can’t draw any real conclusions from this yet. Its main purpose for now is to confirm that our case_when() logic successfully categorized all 1.4 million rows.

Plot 3: Distribution of Inspections Over Time This histogram is one of our most important initial plots. It confirms two things:

The data is recent: The vast majority of our inspections occurred from roughly 2005 to the present, which is excellent for answering our research questions about modern trends.

We have “bad data” to filter: The plot clearly identifies a small cluster of “bad data” (placeholder dates from 1900) that we must filter out before conducting any time-series analysis. The warning message also confirmed that a negligible number of rows (~6,400, or <0.5%) were dropped due to missing dates, which will not impact our analysis.

City Comparisons

#Outcomes by City
ggplot(inspections, aes(x = city_source, fill = inspection_outcome)) + 
  geom_bar(position = "fill") + 
  labs(title = "Outcomes by City", y = "Proportion")


#Risk Level by City
ggplot(inspections, aes(x = city_source, fill = risk_level_standard)) + 
  geom_bar(position = "fill") +
    labs(title = "Risk Level by City", y = "Proportion")


#Score Distributions
ggplot(inspections, aes(x = city_source, y = inspection_score)) + 
  geom_boxplot() +
    labs(title = "Score Distributions")


ggplot(inspections, aes(x = inspection_outcome, y = inspection_score, fill = city_source)) +
  geom_boxplot() + 
  facet_wrap(~ city_source) + #show how our standardized Pass/Fail categories line up with the numeric scores
    labs(title = "Score Distributions by City (Standardized")

After establishing the overall shape of our data, we drilled down to compare the four cities. This is where our standardization efforts provide their first major insights, revealing that the cities have vastly different inspection profiles.

Plot 4: Proportions of Outcomes by City This plot directly compares the pass/fail rates for each city. We found that Boston and San Francisco have the highest proportion of “Fail” outcomes. In contrast, New York’s profile is dominated by “Pass” (reflecting their ‘A’ grades) and “Pending” results. This confirms that each city’s grading system is unique and that our inspection_outcome column successfully captured these differences.

Plot 5: Proportion of Risk Level by City This is one of the most striking findings so far. Chicago and New York log a dramatically higher proportion of “High Risk” violations than Boston and San Francisco. In Chicago, “High Risk” appears to be the most common category, while in Boston, “Low Risk” is dominant. This may reflect different public health definitions of “risk” or different areas of focus for each city’s inspectors.

Plot 6 & 7: Score Distributions These two plots (which we’ve combined for analysis) confirm our inspection_outcome logic and reveal the inverted nature of the scoring systems.

NYC uses a demerit system: A lower score is better. Our plot shows that “Pass” (Grade A) has the lowest scores, while “Fail” (Grade C) has the highest.

SF uses a points-based system: A higher score is better. Our plot confirms this, showing “Pass” (90+) has the highest scores and “Fail” (71-85) has lower scores.

This analysis validates our standardization and proves that the two scoring systems are direct opposites.

Exploring Fail Rates Over Time


# Create a summary table
fails_by_year <- inspections |>
  filter(inspection_date > "2000-01-01") |> # filter out 'bad' dates
  mutate(year = year(inspection_date)) |>
  group_by(city_source, year) |>
  summarize(
    fail_rate = mean(inspection_outcome == "Fail", na.rm = TRUE),
    total_inspections = n()
  ) |>
  filter(total_inspections > 100) # Remove years with tiny data

# Plot the summary
ggplot(fails_by_year, aes(x = year, y = fail_rate, color = city_source)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = label_percent()) +     # format y-axis as percentage
  scale_color_viridis_d() +                         # add better colors
  labs(
    title = "Fail Rate Over Time by City", 
    y = "Proportion of Fails",                      # <-- Fix y-axis label
    x = "Year",
    color = "City"                                  # <-- Fix legend title
  )


# Least Successful zipcodes in each city in the dataset
fails_by_zip <- inspections |>
  filter(inspection_date > "2000-01-01") |> # Good idea to filter bad dates
  group_by(city_source, zip_code) |>
  summarize(
    fail_rate = mean(inspection_outcome == "Fail", na.rm = TRUE),
    total_inspections = n()
  ) |>
  filter(total_inspections > 100) |> # Only include zips with 100+ inspections
  ungroup()

# Summary table for zip codes
fails_by_zip |>
  group_by(city_source) |>      # Group by city
  top_n(3, fail_rate) |>        # Find the top 3 *within* that city
  ungroup() |>
  
  # This 'reorder_within' is a special trick to make facets work
  mutate(
    zip_code = tidytext::reorder_within(zip_code, fail_rate, city_source)
  ) |>

  ggplot(aes(x = zip_code, y = fail_rate, fill = city_source)) +
  geom_col() +
  tidytext::scale_x_reordered() + # This makes the labels plot correctly
  facet_wrap(~ city_source, scales = "free") + #Create 4 separate charts
  coord_flip() +
  labs(
    title = "Top 3 Zip Codes with Highest Fail Rate by City",
    x = "Zip Code",
    y = "Fail Rate"
  ) +
  theme(legend.position = "none")

Following our city-level comparisons, we moved into our first multivariate analysis to answer our research questions about where and when violations occur.

Plot 8: Fail Rate Over Time by City This line graph plots the proportion (percentage) of “Fail” outcomes for each city, from 2008 to 2025. This is our most significant finding so far:

Dramatically Different Systems: The definition of a “Fail” is clearly not standardized.

Boston: Has a very high and consistent “Fail” rate, hovering around 60% for the entire 15-year period. This strongly suggests that Boston’s result column logs a “Fail” for even minor infractions, making it a very common outcome.

Chicago: Shows a positive trend. Its fail rate has steadily decreased from ~25% in 2010 to a stable ~18-20% in recent years.

New York: Has an extremely low fail rate, consistently staying below 10%. This aligns perfectly with their “Grade A” system, where a ‘C’ (which we mapped to “Fail”) is rare.

San Francisco: Appears to have a fail rate of around 30%, landing between Boston and the other two cities.

Plot 9: Top 3 Worst Zip Codes by City This bar chart was intended to find the “worst” zip codes across the entire dataset. We have now successfully identified specific geographic areas of concern for each city:

Boston: 02119-3212, 02120

Chicago: 60827, 60619

New York: 11436, 11428

San Francisco: 94134, 94118

Boston Zipcodes Map

There are 43 zip codes in the City of Boston

Top Violations Overall

top_bigrams <- inspections |>
  sample_frac(0.5) |>  # Taking a 50% random sample to reduce memory load
  filter(!is.na(violation_desc)) |>
  unnest_tokens(bigram, violation_desc, token = "ngrams", n = 2) |>
  count(bigram, sort = TRUE)

# Plot the top 20
top_bigrams |>
  top_n(20) |>
  ggplot(aes(x = reorder(bigram, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 20 Most Common Violation Phrases (50% Sample)",
    x = "Violation Phrase (Bigram)",
    y = "Total Count"
  )

This plot, despite successfully processing the text, shows that the most common phrases across all cities are generic, high-frequency functional words like ‘food contact’ and ‘non food’. We do not see specific issues like ‘rodent’ or ‘improper temperature’ in the top 20, confirming that these key issues are diluted by common, general phrases. This validates our next step to use manual keyword searching to group thousands of unique violations into meaningful categories.

Days of the week that see the most inspections

inspections |> 
  mutate(wday = wday(inspection_date, label = TRUE)) |> 
  count(wday) |>
  ggplot(aes(wday, n)) +
  geom_col() +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Inspections by Day of Week")

This analysis confirms that inspection frequency is heavily skewed toward weekdays, with a slight peak on Tuesday. The low counts on Sunday and Saturday suggest that weekend inspections are rare or reserved only for specific facilities.

Inspection Frequency per Restaurant

ggplot(inspections |> count(restaurant_name), aes(n)) +
  geom_histogram(breaks = seq(0, 50, by = 1), color="black", fill="skyblue") +
  labs(title="Distribution of Inspections per Restaurant", x="Number of Inspections", y="Number of Restaurants")

The tallest bar shows that nearly 10,000 restaurants have been inspected only once (or close to once) during the dataset’s time frame. The number of restaurants drops off steeply for businesses inspected just 2, 3, or 4 times. The long, flat tail on the right is highly significant. While most businesses are inspected infrequently, a small but consistent number of restaurants are inspected 20, 30, 40, or even 50+ times. This may be due to routine inspections happening, but this is unlikely for the time period covered in our data. This coudl also suggest that the inspection system is effectively using a risk-based approach. Restaurants that fail or have serious violations are likely pushed into re-inspection cycles or receive more frequent routine inspections. Conversely, successful restaurants are inspected less often, freeing up resources.

Answering Research Questions

Seun’s Analysis

Categorize Violations

inspections_categorized <- inspections |>
  mutate(
    violation_category = case_when(
      str_detect(violation_desc, "mice|rodent|droppings|insects|pests|harborage") ~ "Pest Control",
      str_detect(violation_desc, "temperature|cool|hot-holding|cold-holding|refrigerat") ~ "Food Temperature",
      str_detect(violation_desc, "wash|sanitiz|clean|wiping cloths|dishwashing") ~ "Sanitation & Cleaning",
      str_detect(violation_desc, "hand wash|hygiene|no hair|bare hand") ~ "Employee Hygiene",
      str_detect(violation_desc, "contaminat|cross-contamination|raw|covered|protect") ~ "Cross-Contamination",
      str_detect(violation_desc, "floor|wall|ceiling|plumbing|ventilation|garbage") ~ "Facility/Maintenance",
      TRUE ~ "Other"
    )
  )

#Check new categories
inspections_categorized |>
  count(violation_category, sort = TRUE)

Categorize Restaurants

inspections_categorized <- inspections_categorized |>
  mutate(
    # Create one combined description column to search
    desc_all = coalesce(cuisine_description, facility_type, descript) |> tolower(),
    
    restaurant_category = case_when(
      str_detect(desc_all, "pizza|pizzeria") ~ "Pizza",
      str_detect(desc_all, "sushi|japanese") ~ "Sushi/Japanese",
      str_detect(desc_all, "coffee|cafe|bakery|donut") ~ "Coffee/Bakery",
      str_detect(desc_all, "mexican|taco") ~ "Mexican",
      str_detect(desc_all, "chinese") ~ "Chinese",
      str_detect(desc_all, "sandwich|deli|sub") ~ "Sandwich/Deli",
      str_detect(desc_all, "bar|tavern|pub") ~ "Bar/Pub",
      str_detect(desc_all, "restaurant|american") ~ "Restaurant (General)",
      str_detect(desc_all, "school|grocery|market|caterer|kitchen") ~ "Other (Grocery/School/Etc.)",
      TRUE ~ "Other"
    )
  )

#Check new categories
inspections_categorized |>
  count(restaurant_category, sort = TRUE)

Violation “Fingerprint” Plot

#Summary table
violation_fingerprint <- inspections_categorized |>
  filter(
    restaurant_category != "Other",
    restaurant_category != "Other (Grocery/School/Etc.)",
    violation_category != "Other"
  ) |>
  count(restaurant_category, violation_category) |>
  group_by(restaurant_category) |>
  mutate(proportion = n / sum(n)) |>
  ungroup()

# Plot the "fingerprint"
ggplot(violation_fingerprint, 
       aes(x = violation_category, y = proportion, fill = violation_category)) +
  geom_col() +
  facet_wrap(~ restaurant_category) + 
  scale_y_continuous(labels = label_percent()) +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    legend.position = "bottom"
  ) +
  labs(
    title = "Violation 'Fingerprint' by Restaurant Category",
    y = "Proportion of All Violations",
    x = "Violation Type",
    fill = "Violation Category"
  )

Restaurant categories do disproportionately receive certain violation types. While “Sanitation & Cleaning” is a universal problem, specific models carry unique risks. The most important takeaway is that Sanitation & Cleaning (pink) and Pest Control (blue) are the two largest violation categories across nearly all restaurant types. However, the proportion of other key categories reveals important differences:

Pest Control Dominance:

Sushi/Japanese, Sandwich/Deli, and Bar/Pubs have the highest proportion of Pest Control violations, consistently making up over 25% of their total issues.

Sanitation & Cleaning Focus:

Coffee/Bakery, Pizza, and Chinese restaurants are highly dominated by Sanitation & Cleaning violations (pink), with this category often exceeding 40% of their total violations.

Food Temperature Risk (The Outlier):

Restaurant (General) is the only category where Food Temperature (teal) makes up the largest non-Sanitation/Pest share, hitting approximately 25–30% of its violations. This suggests that businesses with diverse menus and complex cooking processes are uniquely prone to temperature-related failures.

Cross-Contamination Risk:

Mexican and Sandwich/Deli restaurants show a slightly elevated proportional risk for Cross-Contamination (red), often related to the high volume of raw preparation (e.g., cutting vegetables, preparing salsas, handling deli meats).

Standardize Inspection Types

inspections_categorized <- inspections_categorized |>
  mutate(
    type_lower = tolower(inspection_type),
    
    inspection_type_standard = case_when(
      # Keywords for complaint
      str_detect(type_lower, "complaint|311|owner complaint") ~ "Complaint",
      
      # Keywords for routine
      str_detect(type_lower, "routine|cycle|canvass|regular") ~ "Routine",
      
      # Keywords for follow-ups
      str_detect(type_lower, "re-inspection|reinspection|follow-up") ~ "Follow-Up",
      
      TRUE ~ "Other"
    )
  )

#new standardized categories
inspections_categorized |>
  count(inspection_type_standard, sort = TRUE)

Plot Severity by Inspection Type


inspections_categorized |>
  # 1. Filter the data (No changes needed here)
  filter(
    inspection_type_standard %in% c("Routine", "Complaint"),
    risk_level_standard != "Unknown"
  ) |> # <--- PIPE TO THE NEXT STEP
  
  # 2. Mutate (Convert to Factor)
  mutate(
    risk_level_standard = factor(
      risk_level_standard, 
      # Set the desired hierarchical order
      levels = c("High", "Medium", "Low"),
      ordered = TRUE
    )
  ) |> # <--- CORRECT PIPE PLACEMENT (Outside the final mutate parenthesis)
  
  # 3. Plot proportions
  ggplot(aes(x = inspection_type_standard, fill = risk_level_standard)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::label_percent()) +
  scale_fill_viridis_d(option = "C") +
  labs(
    title = "Violation Severity by Inspection Type",
    x = "Inspection Type",
    y = "Proportion of Violations",
    fill = "Risk Level"
  )

The proportional plot clearly shows that Complaint-Driven inspections are more likely to find severe problems than Routine inspections12. The proportion of High Risk violations is visually larger in the Complaint column than in the Routine column13. This suggests the public is effective at reporting restaurants that pose an immediate public health risk.

Plot Violation Type by Inspection Type

inspections_categorized |>
  filter(
    inspection_type_standard %in% c("Routine", "Complaint"),
    violation_category != "Other" # Remove 'Other' violations
  ) |>
  
  # Plot proportions
  ggplot(aes(x = inspection_type_standard, fill = violation_category)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::label_percent()) +
  scale_fill_viridis_d() +
  labs(
    title = "Violation Types by Inspection Type",
    x = "Inspection Type",
    y = "Proportion of Violations",
    fill = "Violation Category"
  ) +
  coord_flip() # Flip axes to make categories easier to read

When looking at violation type, Complaint inspections are disproportionately focused on Pest Control and Food Temperature issues. In contrast, Routine inspections find a much higher proportion of Sanitation & Cleaning and Facility/Maintenance problems. This suggests that patrons are quick to report visible signs of pests or temperature issues, while inspectors find more structural/hygiene problems during unannounced, comprehensive checks.

Our analysis confirms two critical relationships:

  1. Business Model Risk: The type of cuisine a restaurant serves creates a unique “fingerprint” of health risks (Q1). Complex cooking (Restaurant General) correlates with Food Temperature issues, while high-volume raw preparation (Sushi/Deli) correlates with Pest Control and Cross-Contamination.
  2. Inspection Type Bias: Public complaints (Q2) are highly effective at flagging immediate public health threats, demonstrating a higher proportional incidence of High Risk violations and Pest Control problems compared to routine inspections.
---
title: "R Students Project Notebook"
output:
  html_document:
    df_print: paged
  pdf_document: default
  html_notebook: default
---

### Load Libraries and Data
```{r setup, message=FALSE, warning=FALSE}
library(tidyverse)
library(sf)
library(lubridate)
library(tidytext)
library(scales)
library(viridis)

# Load the single, final clean file
inspections <- read_csv("data/inspections_clean.csv")
```

## EDA Report
The goal of this section is to understand the overall shape of our data.

### Basic Plots
```{r Basic Plots}
# Inspections by City
ggplot(inspections, aes(x = city_source)) + geom_bar() + labs(title = "Inspections by City") + theme_minimal()

#Outcomes
ggplot(inspections, aes(x = inspection_outcome)) + 
  geom_bar() +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Outcomes")

#Risks
ggplot(inspections, aes(x = risk_level_standard)) + 
  labs(title = "Risks") + 
  scale_y_continuous(labels = label_comma()) + 
  geom_bar()
  
#Time Range of the Data
ggplot(inspections, aes(x = inspection_date)) + geom_histogram(bins = 50) + 
scale_y_continuous(labels = label_comma()) +
  labs(title = "Distribution of Inspections Over Time", x = "Date", y = "Count")
```
Our initial phase of exploratory data analysis (EDA) focused on understanding the overall shape, distribution, and integrity of our combined 1.4 million-row dataset. This initial analysis confirms our combined dataset is large, usable, and modern, but critically imbalanced.

Plot 1: Inspections by City
This bar chart shows the total number of inspections contributed by each city. The key finding here is that our dataset is heavily imbalanced. Boston is the source of a large majority of our data (over 750,000 rows), while Chicago and New York are similarly sized, and San Francisco is our smallest dataset.

This is a critical finding because it confirms that we cannot use raw counts to compare cities. All future city-to-city comparisons must be based on proportions or percentages (using position = "fill") to be fair and accurate.

Plot 2: Overall Inspection Outcomes
This plot shows the distribution of our newly-created inspection_outcome column across the entire dataset. While "Fail" and "Pass" are the most frequent outcomes, this plot is heavily skewed by the disproportionate size of the Boston data. We can't draw any real conclusions from this yet. Its main purpose for now is to confirm that our case_when() logic successfully categorized all 1.4 million rows.

Plot 3: Distribution of Inspections Over Time
This histogram is one of our most important initial plots. It confirms two things:

The data is recent: The vast majority of our inspections occurred from roughly 2005 to the present, which is excellent for answering our research questions about modern trends.

We have "bad data" to filter: The plot clearly identifies a small cluster of "bad data" (placeholder dates from 1900) that we must filter out before conducting any time-series analysis. The warning message also confirmed that a negligible number of rows (~6,400, or <0.5%) were dropped due to missing dates, which will not impact our analysis.


### City Comparisons
```{r City Comparisons}
#Outcomes by City
ggplot(inspections, aes(x = city_source, fill = inspection_outcome)) + 
  geom_bar(position = "fill") + 
  labs(title = "Outcomes by City", y = "Proportion")

#Risk Level by City
ggplot(inspections, aes(x = city_source, fill = risk_level_standard)) + 
  geom_bar(position = "fill") +
    labs(title = "Risk Level by City", y = "Proportion")

#Score Distributions
ggplot(inspections, aes(x = city_source, y = inspection_score)) + 
  geom_boxplot() +
    labs(title = "Score Distributions")

ggplot(inspections, aes(x = inspection_outcome, y = inspection_score, fill = city_source)) +
  geom_boxplot() + 
  facet_wrap(~ city_source) + #show how our standardized Pass/Fail categories line up with the numeric scores
    labs(title = "Score Distributions by City (Standardized")

```
After establishing the overall shape of our data, we drilled down to compare the four cities. This is where our standardization efforts provide their first major insights, revealing that the cities have vastly different inspection profiles.

Plot 4: Proportions of Outcomes by City 
This plot directly compares the pass/fail rates for each city. We found that Boston and San Francisco have the highest proportion of "Fail" outcomes. In contrast, New York's profile is dominated by "Pass" (reflecting their 'A' grades) and "Pending" results. This confirms that each city's grading system is unique and that our inspection_outcome column successfully captured these differences.

Plot 5: Proportion of Risk Level by City
This is one of the most striking findings so far. Chicago and New York log a dramatically higher proportion of "High Risk" violations than Boston and San Francisco. In Chicago, "High Risk" appears to be the most common category, while in Boston, "Low Risk" is dominant. This may reflect different public health definitions of "risk" or different areas of focus for each city's inspectors.

Plot 6 & 7: Score Distributions
These two plots (which we've combined for analysis) confirm our inspection_outcome logic and reveal the inverted nature of the scoring systems.

NYC uses a demerit system: A lower score is better. Our plot shows that "Pass" (Grade A) has the lowest scores, while "Fail" (Grade C) has the highest.

SF uses a points-based system: A higher score is better. Our plot confirms this, showing "Pass" (90+) has the highest scores and "Fail" (71-85) has lower scores.

This analysis validates our standardization and proves that the two scoring systems are direct opposites.

### Exploring Fail Rates Over Time 
```{r Exploring Fail Rates}

# Create a summary table
fails_by_year <- inspections |>
  filter(inspection_date > "2000-01-01") |> # filter out 'bad' dates
  mutate(year = year(inspection_date)) |>
  group_by(city_source, year) |>
  summarize(
    fail_rate = mean(inspection_outcome == "Fail", na.rm = TRUE),
    total_inspections = n()
  ) |>
  filter(total_inspections > 100) # Remove years with tiny data

# Plot the summary
ggplot(fails_by_year, aes(x = year, y = fail_rate, color = city_source)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = label_percent()) +     # format y-axis as percentage
  scale_color_viridis_d() +                         # add better colors
  labs(
    title = "Fail Rate Over Time by City", 
    y = "Proportion of Fails",                      # <-- Fix y-axis label
    x = "Year",
    color = "City"                                  # <-- Fix legend title
  )

# Least Successful zipcodes in each city in the dataset
fails_by_zip <- inspections |>
  filter(inspection_date > "2000-01-01") |> # Good idea to filter bad dates
  group_by(city_source, zip_code) |>
  summarize(
    fail_rate = mean(inspection_outcome == "Fail", na.rm = TRUE),
    total_inspections = n()
  ) |>
  filter(total_inspections > 100) |> # Only include zips with 100+ inspections
  ungroup()

# Summary table for zip codes
fails_by_zip |>
  group_by(city_source) |>      # Group by city
  top_n(3, fail_rate) |>        # Find the top 3 *within* that city
  ungroup() |>
  
  # This 'reorder_within' is a special trick to make facets work
  mutate(
    zip_code = tidytext::reorder_within(zip_code, fail_rate, city_source)
  ) |>

  ggplot(aes(x = zip_code, y = fail_rate, fill = city_source)) +
  geom_col() +
  tidytext::scale_x_reordered() + # This makes the labels plot correctly
  facet_wrap(~ city_source, scales = "free") + #Create 4 separate charts
  coord_flip() +
  labs(
    title = "Top 3 Zip Codes with Highest Fail Rate by City",
    x = "Zip Code",
    y = "Fail Rate"
  ) +
  theme(legend.position = "none")

```

Following our city-level comparisons, we moved into our first multivariate analysis to answer our research questions about where and when violations occur.

Plot 8: Fail Rate Over Time by City
This line graph plots the proportion (percentage) of "Fail" outcomes for each city, from 2008 to 2025. This is our most significant finding so far:

Dramatically Different Systems: The definition of a "Fail" is clearly not standardized.

Boston: Has a very high and consistent "Fail" rate, hovering around 60% for the entire 15-year period. This strongly suggests that Boston's result column logs a "Fail" for even minor infractions, making it a very common outcome.

Chicago: Shows a positive trend. Its fail rate has steadily decreased from ~25% in 2010 to a stable ~18-20% in recent years.

New York: Has an extremely low fail rate, consistently staying below 10%. This aligns perfectly with their "Grade A" system, where a 'C' (which we mapped to "Fail") is rare.

San Francisco: Appears to have a fail rate of around 30%, landing between Boston and the other two cities.

Plot 9: Top 3 Worst Zip Codes by City
This bar chart was intended to find the "worst" zip codes across the entire dataset.
We have now successfully identified specific geographic areas of concern for each city:

Boston: 02119-3212, 02120

Chicago: 60827, 60619

New York: 11436, 11428

San Francisco: 94134, 94118

### Boston Zipcodes Map

```{r eval=FALSE, include=FALSE}
bos_insp <- inspections |>
  filter(city_source == "Boston")
bos_shp <- st_read("Boston Shapefiles/ZIP_Codes.shp")

bos_borders <- bos_shp |> st_union() |>
  st_buffer(.5)
ggplot(bos_borders) +
  geom_sf(data=bos, aes(fill = ZIP5))


# Density map of restaurants in each zipcode

rstnt_density <- bos_insp |>
  group_by(restaurant_name) |>
  group_by(zip_code) |>
  mutate(n_per_zip = N())

bos$zip_code <- bos$ZIP5

bos <- full_join(rstnt_density, bos, by = zip_code)

ggplot(bos_borders) +
  geom_sf(data=bos, aes(fill_n_per_zip))
```

There are 43 zip codes in the City of Boston

### Top Violations Overall

```{r Top Violations Overall}
top_bigrams <- inspections |>
  sample_frac(0.5) |>  # Taking a 50% random sample to reduce memory load
  filter(!is.na(violation_desc)) |>
  unnest_tokens(bigram, violation_desc, token = "ngrams", n = 2) |>
  count(bigram, sort = TRUE)

# Plot the top 20
top_bigrams |>
  top_n(20) |>
  ggplot(aes(x = reorder(bigram, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 20 Most Common Violation Phrases (50% Sample)",
    x = "Violation Phrase (Bigram)",
    y = "Total Count"
  )
```

This plot, despite successfully processing the text, shows that the most common phrases across all cities are generic, high-frequency functional words like 'food contact' and 'non food'. We do not see specific issues like 'rodent' or 'improper temperature' in the top 20, confirming that these key issues are diluted by common, general phrases. This validates our next step to use manual keyword searching to group thousands of unique violations into meaningful categories.

### Days of the week that see the most inspections

```{r Inspections by Day of Week}
inspections |> 
  mutate(wday = wday(inspection_date, label = TRUE)) |> 
  count(wday) |>
  ggplot(aes(wday, n)) +
  geom_col() +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Inspections by Day of Week")
```

This analysis confirms that inspection frequency is heavily skewed toward weekdays, with a slight peak on Tuesday. The low counts on Sunday and Saturday  suggest that weekend inspections are rare or reserved only for specific facilities.

### Inspection Frequency per Restaurant

```{r Inspections per Restaurant}
ggplot(inspections |> count(restaurant_name), aes(n)) +
  geom_histogram(breaks = seq(0, 50, by = 1), color="black", fill="skyblue") +
  labs(title="Distribution of Inspections per Restaurant", x="Number of Inspections", y="Number of Restaurants")
```

The tallest bar shows that nearly 10,000 restaurants have been inspected only once (or close to once) during the dataset's time frame. The number of restaurants drops off steeply for businesses inspected just 2, 3, or 4 times.
The long, flat tail on the right is highly significant. While most businesses are inspected infrequently, a small but consistent number of restaurants are inspected 20, 30, 40, or even 50+ times. This may be due to routine inspections happening, but this is unlikely for the time period covered in our data. This coudl also suggest that the inspection system is effectively using a risk-based approach. Restaurants that fail or have serious violations are likely pushed into re-inspection cycles or receive more frequent routine inspections. Conversely, successful restaurants are inspected less often, freeing up resources.

## Answering Research Questions

## Seun's Analysis 
### Categorize Violations
```{r Categorize Violations}
inspections_categorized <- inspections |>
  mutate(
    violation_category = case_when(
      str_detect(violation_desc, "mice|rodent|droppings|insects|pests|harborage") ~ "Pest Control",
      str_detect(violation_desc, "temperature|cool|hot-holding|cold-holding|refrigerat") ~ "Food Temperature",
      str_detect(violation_desc, "wash|sanitiz|clean|wiping cloths|dishwashing") ~ "Sanitation & Cleaning",
      str_detect(violation_desc, "hand wash|hygiene|no hair|bare hand") ~ "Employee Hygiene",
      str_detect(violation_desc, "contaminat|cross-contamination|raw|covered|protect") ~ "Cross-Contamination",
      str_detect(violation_desc, "floor|wall|ceiling|plumbing|ventilation|garbage") ~ "Facility/Maintenance",
      TRUE ~ "Other"
    )
  )

#Check new categories
inspections_categorized |>
  count(violation_category, sort = TRUE)
```


### Categorize Restaurants
```{r Categorize Restaurants}
inspections_categorized <- inspections_categorized |>
  mutate(
    # Create one combined description column to search
    desc_all = coalesce(cuisine_description, facility_type, descript) |> tolower(),
    
    restaurant_category = case_when(
      str_detect(desc_all, "pizza|pizzeria") ~ "Pizza",
      str_detect(desc_all, "sushi|japanese") ~ "Sushi/Japanese",
      str_detect(desc_all, "coffee|cafe|bakery|donut") ~ "Coffee/Bakery",
      str_detect(desc_all, "mexican|taco") ~ "Mexican",
      str_detect(desc_all, "chinese") ~ "Chinese",
      str_detect(desc_all, "sandwich|deli|sub") ~ "Sandwich/Deli",
      str_detect(desc_all, "bar|tavern|pub") ~ "Bar/Pub",
      str_detect(desc_all, "restaurant|american") ~ "Restaurant (General)",
      str_detect(desc_all, "school|grocery|market|caterer|kitchen") ~ "Other (Grocery/School/Etc.)",
      TRUE ~ "Other"
    )
  )

#Check new categories
inspections_categorized |>
  count(restaurant_category, sort = TRUE)
```

### Violation "Fingerprint" Plot
```{r Violation "Fingerprint" Plot}
#Summary table
violation_fingerprint <- inspections_categorized |>
  filter(
    restaurant_category != "Other",
    restaurant_category != "Other (Grocery/School/Etc.)",
    violation_category != "Other"
  ) |>
  count(restaurant_category, violation_category) |>
  group_by(restaurant_category) |>
  mutate(proportion = n / sum(n)) |>
  ungroup()

# Plot the "fingerprint"
ggplot(violation_fingerprint, 
       aes(x = violation_category, y = proportion, fill = violation_category)) +
  geom_col() +
  facet_wrap(~ restaurant_category) + 
  scale_y_continuous(labels = label_percent()) +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    legend.position = "bottom"
  ) +
  labs(
    title = "Violation 'Fingerprint' by Restaurant Category",
    y = "Proportion of All Violations",
    x = "Violation Type",
    fill = "Violation Category"
  )
```

Restaurant categories do disproportionately receive certain violation types. While "Sanitation & Cleaning" is a universal problem, specific models carry unique risks. 
The most important takeaway is that Sanitation & Cleaning (pink) and Pest Control (blue) are the two largest violation categories across nearly all restaurant types. However, the proportion of other key categories reveals important differences:

Pest Control Dominance:

Sushi/Japanese, Sandwich/Deli, and Bar/Pubs have the highest proportion of Pest Control violations, consistently making up over 25% of their total issues.

Sanitation & Cleaning Focus:

Coffee/Bakery, Pizza, and Chinese restaurants are highly dominated by Sanitation & Cleaning violations (pink), with this category often exceeding 40% of their total violations.

Food Temperature Risk (The Outlier):

Restaurant (General) is the only category where Food Temperature (teal) makes up the largest non-Sanitation/Pest share, hitting approximately 25–30% of its violations. This suggests that businesses with diverse menus and complex cooking processes are uniquely prone to temperature-related failures.

Cross-Contamination Risk:

Mexican and Sandwich/Deli restaurants show a slightly elevated proportional risk for Cross-Contamination (red), often related to the high volume of raw preparation (e.g., cutting vegetables, preparing salsas, handling deli meats).

### Standardize Inspection Types
```{r Standardize Inspection Types}
inspections_categorized <- inspections_categorized |>
  mutate(
    type_lower = tolower(inspection_type),
    
    inspection_type_standard = case_when(
      # Keywords for complaint
      str_detect(type_lower, "complaint|311|owner complaint") ~ "Complaint",
      
      # Keywords for routine
      str_detect(type_lower, "routine|cycle|canvass|regular") ~ "Routine",
      
      # Keywords for follow-ups
      str_detect(type_lower, "re-inspection|reinspection|follow-up") ~ "Follow-Up",
      
      TRUE ~ "Other"
    )
  )

#new standardized categories
inspections_categorized |>
  count(inspection_type_standard, sort = TRUE)
```

### Plot Severity by Inspection Type
```{r Plot Severity by Inspection Type}

inspections_categorized |>
  # 1. Filter the data (No changes needed here)
  filter(
    inspection_type_standard %in% c("Routine", "Complaint"),
    risk_level_standard != "Unknown"
  ) |> # <--- PIPE TO THE NEXT STEP
  
  # 2. Mutate (Convert to Factor)
  mutate(
    risk_level_standard = factor(
      risk_level_standard, 
      # Set the desired hierarchical order
      levels = c("High", "Medium", "Low"),
      ordered = TRUE
    )
  ) |> # <--- CORRECT PIPE PLACEMENT (Outside the final mutate parenthesis)
  
  # 3. Plot proportions
  ggplot(aes(x = inspection_type_standard, fill = risk_level_standard)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::label_percent()) +
  scale_fill_viridis_d(option = "C") +
  labs(
    title = "Violation Severity by Inspection Type",
    x = "Inspection Type",
    y = "Proportion of Violations",
    fill = "Risk Level"
  )
```

The proportional plot clearly shows that Complaint-Driven inspections are more likely to find severe problems than Routine inspections12. The proportion of High Risk violations is visually larger in the Complaint column than in the Routine column13. This suggests the public is effective at reporting restaurants that pose an immediate public health risk.

## Plot Violation Type by Inspection Type
```{r Plot Violation Type by Inspection Type}
inspections_categorized |>
  filter(
    inspection_type_standard %in% c("Routine", "Complaint"),
    violation_category != "Other" # Remove 'Other' violations
  ) |>
  
  # Plot proportions
  ggplot(aes(x = inspection_type_standard, fill = violation_category)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::label_percent()) +
  scale_fill_viridis_d() +
  labs(
    title = "Violation Types by Inspection Type",
    x = "Inspection Type",
    y = "Proportion of Violations",
    fill = "Violation Category"
  ) +
  coord_flip() # Flip axes to make categories easier to read
```

When looking at violation type, Complaint inspections are disproportionately focused on Pest Control and Food Temperature issues. In contrast, Routine inspections find a much higher proportion of Sanitation & Cleaning and Facility/Maintenance problems. This suggests that patrons are quick to report visible signs of pests or temperature issues, while inspectors find more structural/hygiene problems during unannounced, comprehensive checks.

Our analysis confirms two critical relationships:

1.  **Business Model Risk:** The type of cuisine a restaurant serves creates a unique "fingerprint" of health risks (Q1). Complex cooking (Restaurant General) correlates with **Food Temperature** issues, while high-volume raw preparation (Sushi/Deli) correlates with **Pest Control** and **Cross-Contamination**.
2.  **Inspection Type Bias:** Public complaints (Q2) are highly effective at flagging immediate public health threats, demonstrating a higher proportional incidence of **High Risk** violations and **Pest Control** problems compared to routine inspections.